\section{Our technique}

The work described in this paper is part of the \cleavir{} compiler
framework.  \cleavir{} is currently part of the \sicl{} project%
\footnote{https://github.com/robert-strandh/SICL}, but we may turn it
into an independent project in the future.

In our compiler, source code is first converted to an \emph{abstract
  syntax tree}.  In such a tree, lexical variables and lexical
function names have been converted to unique objects.  When a globally
defined function $F$ is inlined into another function $G$, we
incorporate the abstract syntax tree of $F$ as if it were a local
function in $G$.  No alpha renaming is required.  Notice that this
step in itself does not count as inlining.  The function $F$ is still
invoked using the normal function-call protocol at this stage.

In the second phase, the abstract syntax tree is translated to
intermediate code in the form of a flow graph of instructions.  Our
inlining technique is designed to work on this intermediate
representation.

There are several advantages of using this intermediate representation
over higher-level ones such as source code or abstract
syntax trees, as we will show in greater detail below, namely:

\begin{itemize}
\item Each iteration of the algorithm defined by our technique is very
  simple, and we can be shown to preserve the semantics of the
  program.
\item Because each iteration preserves the semantics, the process can
  be interrupted at any point in time, resulting in a \emph{partial}
  inlining of the called function.
\end{itemize}

Furthermore, this intermediate code representation is similar to the
one used in many traditional compiler optimization techniques, making
it possible to reuse code for similar transformations.

One potential drawback of this representation is that operations on
programs represented this way are inherently imperative, i.e. they
modify the structure of the flow graph.  The use of techniques from
functional programming is therefore difficult or impractical with this
representation.  Moreover, the flow graph resulting from some
arbitrary number of iterations of our technique does not necessarily
have any correspondence as \commonlisp{} source code.

\subsection{Intermediate code}

The intermediate code on which our technique is designed to work is
called \emph{High-level Intermediate Representation}, or \emph{HIR}
for short.  This representation takes the form of a \emph{flow graph}
of \emph{instructions} as used by many traditional compiler
optimization techniques.  The main difference between HIR and the
intermediate representation used in compilers for lower-level
languages is that in HIR, the only data objects that the instructions
manipulate are \commonlisp{} objects.  Arbitrary computations on
addresses are exposed in a later stage called \emph{Medium-level
  Intermediate Representation}, or \emph{MIR}.

Most HIR instructions correspond directly to \commonlisp{} operators
such as the ones in the categories described below.  Notice that,
although the names of the instructions often resemble the names of
\commonlisp{} operators, the instruction typically requires more
precise objects than the corresponding \commonlisp{} operator does.
Thus, the \texttt{car} instruction requires the argument to be a
\texttt{cons} object, and the \texttt{funcall} instruction requires its
first argument to be a function.  The following such categories exist:

\begin{itemize}
\item Low-level accessors such as \texttt{car}, \texttt{cdr},
  \texttt{rplaca}, \texttt{rplacd}, \texttt{aref}, \texttt{aset},
  \texttt{slot-read}, and \texttt{slot-write}.
\item Instructions for low-level arithmetic on, and comparison of,
  floating-point numbers and fixnums.
\item Instructions for testing the type of an object.
\item Instructions such as \texttt{funcall}, \texttt{return}, and
  \texttt{unwind} for handling function calls and returns.
\end{itemize}

Two of the HIR instructions are special in that they do not have direct
corresponding \commonlisp{} operators, and in that they are essential
to the inlining machinery described in this paper:

\begin{itemize}
\item The \texttt{enter} instruction.  This instruction is the first
  one to be executed in a function, and it is responsible for creating
  the initial local lexical environment of the function from the
  arguments given by the calling function.  This initial environment
  is typically augmented by temporary lexical variables during the
  execution of the function.  Variables may also be eliminated from
  the local environment when they are no longer accessible by any
  execution path.
\item The \texttt{enclose} instruction.  This instruction takes the
  \emph{code} of a nested function (represented by its \texttt{enter}
  instruction) and creates a \emph{callable function} that may be a
  \emph{closure}.
\end{itemize}

\subsection{Algorithm}

The algorithm that implements our technique maintains a
\emph{worklist}.  An item%
\footnote{In the code, an item also contains an \texttt{enclose}
  instruction, but we omit this instruction from our description, in
  order to simplify it.}
of the worklist contains:

\begin{itemize}
\item A \texttt{funcall} instruction, representing the call site in the
  calling function.
\item An \texttt{enter} instruction, representing the called function.
\item The successor instruction of the \texttt{enter} instruction,
  called the \emph{target instruction}, or \emph{target} for short.
  The target instruction is the one that is a candidate for inlining, and it
  is used for generic dispatch.
\item A mapping from lexical variables in the called function that
  have already been duplicated in the calling function.
\end{itemize}

In addition to the contents of the worklist items, our algorithm
maintains the following global information, independent of any
worklist item:

\begin{itemize}
\item A mapping from instructions in the called function that have
  already been inlined, to the corresponding instructions in the
  calling function.  This information prevents an instruction from
  being inlined more than once.  Without this information, and in the
  presence of loops in the called function, our inlining algorithm
  would go into an infinite computation.
\item Information about the ownership of lexical variables referred to
  by the called function.  This ownership information indicates
  whether a lexical variable is created by the called function itself,
  or by some enclosing function.  When an instruction to be inlined
  refers to a variable that is created by some enclosing function, the
  reference is maintained without modification.  When the reference is
  to a variable created by the function itself, the inlined
  instruction must refer to the corresponding variable in the calling
  function instead.
\end{itemize}

Prior to algorithm execution, assignment instructions are inserted
before the \texttt{funcall} instruction, copying each argument to a
temporary lexical variable.  These lexical variables represent a copy
of the initial environment of the called function, but allocated in
the calling function.  The pair consisting of the \texttt{funcall} and
the \texttt{enter} instruction can be seen as transferring this
environment from the calling function to the called function.  The
variable correspondences form the initial lexical variable mapping to
be used in the algorithm.

Initially, the worklist contains a single worklist item with the
following contents:

\begin{itemize}
\item The \texttt{funcall} instruction representing the call that
  should be inlined.
\item A \emph{private copy} of the initial \texttt{enter} instruction
  of the function to inline.
\item The successor instruction of the initial \texttt{enter}
  instruction, which is the initial target.
\item The initial lexical variable mapping described previously.
\end{itemize}

In each iteration of the algorithm, a worklist item is removed from
the worklist, and a generic function is called with four arguments,
representing the contents of the worklist item.  Each iteration may
result in zero, one, or two new worklist items, according to the
mappings and ownership information, and according to the number of
successors of the target instruction in this contents.

When the generic function is called in each iteration, one of the
following four rules applies.  As we show in \refSec{sec-correctness},
each of the following rules preserves the overall operational
semantics of the code:

\begin{enumerate}
\item If the target instruction has already been inlined, i.e. it is
  in the mapping containing this information as described previously,
  then replace the \texttt{funcall} instruction by the inlined version
  of the target.  There are two ways of doing this replacement.
  Either the predecessors of the \texttt{funcall} instruction are
  redirected to the inlined version of the target instruction,
  effectively making the \texttt{funcall} instruction unreachable, or
  else, the funcall instruction is replaced by a \texttt{no-operation}
  instruction with the inlined version of the target instruction as
  its successor.  When this rule applies, no new item is added to the
  worklist.
\item If the target instruction is a \texttt{return} instruction, then
  replace the \texttt{funcall} instruction by one or more assignment
  instructions mapping inputs of the \texttt{funcall} instruction to
  outputs of that same instruction.  Again, in this case, no new item
  is added to the worklist.
\item If the target instruction has a single successor, insert a copy
  of the next instruction before the \texttt{funcall} instruction, and
  make the \texttt{enter} instruction refer to that successor.  Update
  the mappings, the inputs of the \texttt{funcall} instruction, and
  the outputs of the \texttt{enter} instruction as described below.
  In this case, the \texttt{funcall} instruction, the \texttt{enter}
  instruction, the new successor of the \texttt{enter} instruction,
  and the updated lexical variable mapping are inserted as a new item
  on the worklist for later processing.
\item If the target instruction has two successors, insert a copy of
  the target instruction before the \texttt{funcall} instruction, and
  replicate the \texttt{funcall} instruction in each branch.  Also
  replicate the \texttt{enter} instruction so that each replica refers
  to a different successor of the original instruction.  Update the
  mappings, the inputs of the \texttt{funcall} instruction, and the
  outputs of the \texttt{enter} instruction as described below.  In
  this case, two new items are inserted on the worklist for later
  processing.  Each item contains a \texttt{funcall} instruction, an
  \texttt{enter} instruction, the successor of the \texttt{enter}
  instruction, and a lexical variable mapping, corresponding to each
  successor branch of the inlined instruction.
\end{enumerate}

For rules $3$ and $4$, when a new instruction is inlined, the mappings,
the inputs to the \texttt{funcall} instruction, and the outputs of the
\texttt{enter} instruction are updated as follows:

\begin{itemize}
\item An entry is created in the mapping from instructions in the
  called function to instructions in the calling function, containing
  the inlined instruction and its copy in the calling function.
\item If some input \texttt{i} to the inlined instruction is present
  in the lexical variable mapping (mapping to (say) \texttt{ii} in the
  calling function) and in the outputs of the \texttt{enter}
  instruction, but \texttt{i} is no longer live after the inlined
  instruction, then the entry \texttt{ii - i} is eliminated from the
  mapping, \texttt{i} is eliminated from the outputs of the
  \texttt{enter} instruction, and \texttt{ii} is eliminated from the
  inputs to the \texttt{funcall} instruction.  It would be
  semantically harmless to leave it intact, but it might harm
  performance if the inlining procedure is stopped when it is still
  partial.  Notice that, when an instruction with two successors is
  inlined, variable liveness may be different in the two successor
  branches.
\item If some output \texttt{o} of the inlined instruction is a new
  variable that is created by that instruction, then we proceed as
  follows.  Let \texttt{I} be the instruction in the called function
  that has been inlined, and let \texttt{II} be the copy of \texttt{I}
  in the calling function.  We create a new variable \texttt{oo} in
  the calling function that takes the place of \texttt{o} in
  \texttt{II}.  We add \texttt{oo} as an input to the \texttt{funcall}
  instruction, \texttt{o} as an output of the \texttt{enter}
  instruction, and we add \texttt{oo - o} to the lexical variable
  mapping.  Again, if the inlined instruction has two successors, the
  lexical variable mapping may have to be updated for one or the other
  or both of the successors.
\end{itemize}

\subsection{Example}

As an example of our technique, consider the initial instruction graph
in \refFig{fig41}.  On the left is the calling function.  It has three
lexical variables, namely \texttt{x}, \texttt{a}, and \texttt{y}.  The
variable \texttt{a} is referenced by the called function, but it is
owned by the calling function.  The called function has a single
variable named \texttt{z} in its initial lexical environment.  A
temporary variable \texttt{w} is created as a result of the execution
of one of the instructions in the called function.

\begin{figure}
\begin{center}
\inputfig{fig41.pdf_t}
\end{center}
\caption{\label{fig41}
Initial instruction graph.}
\end{figure}

Before the inlining procedure is started, we create temporary
variables in the calling function for the variables in the initial
environment of the called function.  We also create a private copy of
the \texttt{enter} instruction so that we can mutate it during the
inlining procedure.  The result is shown in \refFig{fig42}.

\begin{figure}
\begin{center}
\inputfig{fig42.pdf_t}
\end{center}
\caption{\label{fig42}
Instruction graph after initialization.}
\end{figure}

As we can see in \refFig{fig42}, an assignment instruction has been
created that copies the value of the lexical variable \texttt{x} into
a variable \texttt{zz} that mirrors the initial lexical variable
\texttt{z} in the called function.  We also see that there are now two
identical \texttt{enter} instructions.  The one labeled
\texttt{enterA} is the private copy.

Step one of the inlining procedure consists of inlining the successor
of our private \texttt{enter} instruction, i.e. the instruction
labeled \texttt{1} in \refFig{fig42}.  That instruction has a single
successor, and it has not yet been inlined.  Therefore, rule 3
applies, so we insert a copy of that instruction before the
\texttt{funcall} instruction.  Furthermore, since the input to the
original instruction is the lexical variable \texttt{z}, and that
variable is mapped to \texttt{zz} in the calling function, the inlined
instruction receives \texttt{zz} as its input.  The output of the
original instruction is the temporary variable \texttt{w} that is not
in our lexical variable mapping.  Therefore, a temporary variable
\texttt{ww} is created in the calling function, and an entry is
created in the mapping that translates \texttt{w} to \texttt{ww}.  The
private \texttt{enter} instruction (labeled \texttt{enterA}) is
modified so that it now refers to the next instruction to be
considered as a target.  The result of this step is shown in
\refFig{fig43}.

\begin{figure}
\begin{center}
\inputfig{fig43.pdf_t}
\end{center}
\caption{\label{fig43}
Instruction graph after one inlining step.}
\end{figure}

In step two of the inlining procedure, we are considering inlining an
instruction with two successors, i.e. the one labeled \texttt{2} in
\refFig{fig43}.  It has not yet been inlined, so rule number 4
applies.  As rule number 4 stipulates, we must replicate both the
\texttt{enter} instruction and the \texttt{funcall} instruction.  The
result is shown in \refFig{fig44}.

\begin{figure}
\begin{center}
\inputfig{fig44.pdf_t}
\end{center}
\caption{\label{fig44}
Instruction graph after two inlining steps.}
\end{figure}

In \refFig{fig44}, the \texttt{funcall} instruction labeled
\texttt{funcallA} is paired with the \texttt{enter} instruction
labeled \texttt{enterA} and the \texttt{funcall} instruction labeled
\texttt{funcallB} is paired with the \texttt{enter} instruction
labeled \texttt{enterB}.

In step three of the inlining procedure, we consider the
\texttt{funcall} instruction labeled \texttt{funcallB}.  The
corresponding \texttt{enter} instruction has a \texttt{return}
instruction as its successor, so rule number 2 applies.  We must
therefore replace the \texttt{funcall} instruction by an assignment
instruction, assigning the value of the variable \texttt{ww} to the
variable \texttt{y}.  The result of this operation is shown in
\refFig{fig45}.

\begin{figure}
\begin{center}
\inputfig{fig45.pdf_t}
\end{center}
\caption{\label{fig45}
Instruction graph after three inlining steps.}
\end{figure}

In step four of the inlining procedure, we consider the
\texttt{funcall} instruction labeled \texttt{funcallA} in
\refFig{fig45} and the corresponding \texttt{enter} instruction. The
successor of the \texttt{enter} instruction is the instruction labeled
\texttt{1}, and that instruction has already been inlined, so rule
number 1 applies.  We therefore remove the \texttt{funcall} and
redirect its predecessors to the inlined version of the instruction
labeled \texttt{1}.  The result is shown in \refFig{fig46}, and that
completes the inlining procedure.

\begin{figure}
\begin{center}
\inputfig{fig46.pdf_t}
\end{center}
\caption{\label{fig46}
Instruction graph after four inlining steps.}
\end{figure}

After some minor reorganization of the instructions in \refFig{fig46},
we obtain the final result shown in \refFig{fig47}.  Clearly we have an
inlined version of the called function now replicated in the calling
function.

\begin{figure}
\begin{center}
\inputfig{fig47.pdf_t}
\end{center}
\caption{\label{fig47}
Final instruction graph.}
\end{figure}

\subsection{Correctness of our technique}
\label{sec-correctness}

In order to prove total correctness of our technique, we must show
that two conditions hold:

\begin{enumerate}
\item Partial correctness, i.e. the technique must preserve the
  semantics of the program.
\item Termination.
\end{enumerate}

\subsubsection{Partial correctness}

Our technique preserves a very strong version of the semantics of the
program, namely the \emph{operational} semantics.  This fact makes
it unnecessary to create a precise definition of the program
semantics, as might have been the case for some weaker type of
semantics.  Instead, we only need to show that the exact same
operations are performed before and after each inlining step.

After a copy of the initial environment of the called function has
been made in the environment of the calling function, we can see a
pair of \texttt{funcall/enter} instructions as defining a morphism
$\sigma$, mapping the copy of this environment in the calling function
to its original version in the called function.  The inputs of the
\texttt{funcall} instruction are mapped to the outputs of the
\texttt{enter} instruction.  The lexical variable mapping used in our
technique is simply the inverse, i.e $\sigma^{-1}$ of this morphism.
Similarly, a pair of \texttt{return/funcall} instructions can be seen
as defining a morphism $\tau$, mapping the environment in the called
function to the environment in the calling function.  The inputs of
the \texttt{return} instruction are mapped to the outputs of the
\texttt{funcall} instruction.  These morphisms are illustrated in an
example of an initial situation in \refFig{f-semantics1}.

Applying rule $3$ or rule $4$ copies one instruction from the called
function to the calling function, applying the morphism $\sigma^{-1}$
to its inputs and outputs.  Two applications of rule $3$ from the
initial situation are illustrated in \refFig{f-semantics2} and
\refFig{f-semantics3}.  Applying rule $4$ is a bit more involved, but
the same mechanism is used.  As we can see from these figures, thanks
to the morphism, the instructions operate the same way whether inlined
or not.  The semantics are thus the same in both cases.

When rule $2$ is applied, the \texttt{return} instruction is not
copied.  Instead, a number of assignment instructions are created in
the calling function.  Together, these assignment instructions define
the composition of the two morphisms $\tau$ and $\sigma$, i.e. $\tau
\circ \sigma$.  Applying this rule therefore does not alter the
semantics of the program.  It merely maps the returned values to their
copies in the calling function.  Applying this rule is illustrated in
\refFig{f-semantics4}.

Finally, applying rule $1$ merely avoids the control transfer from the
calling function to the called function, by replacing the
\texttt{funcall} instruction by an existing copy of the instruction
that would have been inlined by rule $3$ or rule $4$.  The existing
copy obviously already operates in the environment of the calling
function.

\subsubsection{Termination}

In order to prove termination, we invent a metric with the following
properties:

\begin{itemize}
\item It has a lower bound on its value.
\item Its value decreases with each iteration of our inlining
  procedure.
\end{itemize}

The metric we have chosen for this purpose is called \emph{remaining
  work}, and it is represented as a pair $r = (I,F)$ where $I$ is the
number of instructions that have yet to be inlined, and $F$ is the
number of \texttt{funcall} instructions that have yet to be processed
as part of the worklist items.  Clearly, it has a lower bound on its
value, namely $r_{min} = (0,0)$.

Initially, the remaining work has the value $r_0 = (N,1)$ where $N$ is
the number of instructions in the called function.  We consider the
metric to be lexicographically ordered by its components, i.e. $(I_1,
F_1) < (I_2, F_2)$ if and only if either $I_1 < I_2$ or $I_1 = I_2$
and $F_1 < F_2$.  We show that each step yields a value that is
strictly smaller than before the step.

Consider some iteration $k$ of our inlining procedure, so that $r_k =
(I_k,F_k)$ is the remaining work before the iteration, and $r_{k+1} =
(I_{k+1},F_{k+1})$ is the remaining work after the iteration.

\begin{itemize}
\item If rule number 1 applies, then one \texttt{funcall} instruction
  is eliminated in the iteration, so that $I_{k+1} = I_k$ and $F_{k+1}
  = F_k-1$.  Clearly, $r_{k+1} < r_k$ in this case.
\item If rule number 2 applies, then again one \texttt{funcall}
  instruction is eliminated in the iteration, so that $I_{k+1} = I_k$
  and $F_{k+1} = F_k-1$.  Again, $r_{k+1} < r_k$.
\item If rule number 3 applies, then another instruction is inlined,
  but the number of \texttt{funcall} instructions remains the same, so
  that $I_{k+1} = I_k-1$ and $F_{k+1} = F_k$.  Again, $r_{k+1} < r_k$.
\item Finally, if rule number 4 applies, then another instruction is inlined,
  but the number of \texttt{funcall} instructions increases by $1$, so
  that $I_{k+1} = I_k-1$ and $F_{k+1} = F_k+1$.  Again, $r_{k+1} < r_k$.
\end{itemize}

\begin{figure}
\centering{
\inputtex{fig-semantics1.tex}
\caption{\label{f-semantics1}
Initial situation.}
}
\end{figure}

\begin{figure}
\centering{
\inputtex{fig-semantics2.tex} 
\caption{\label{f-semantics2}
Situation after one application of rule 3.}
}
\end{figure}

\begin{figure}
\centering{
\inputtex{fig-semantics3.tex}
\caption{\label{f-semantics3}
Situation after two applications of rule 3.}
}
\end{figure}

\begin{figure}
\centering{
\inputtex{fig-semantics4.tex}
\caption{\label{f-semantics4}
Situation after an applications of rule 2.}
}
\end{figure}

%%  LocalWords:  worklist funcall inlining inlined lexicographically
%%  LocalWords:  morphism morphisms
